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An efficient low-order scaling method is presented for large-scale electronic structure calculations 
based on the density functional theory using localized basis functions, which directly computes 
selected elements of the density matrix by a contour integration of the Green function evaluated 
with a nested dissection approach for resultant sparse matrices. The computational effort of the 
method scales as 0(Ai'(log2 N)'^), O(Af^), and 0{N'^''^) for one, two, and three dimensional systems, 
respectively, where A'^ is the number of basis functions. Unlike 0{N) methods developed so far 
the approach is a numerically exact alternative to conventional 0(A'^'^) diagonalization schemes in 
spite of the low-order scaling, and can be applicable to not only insulating but also metallic systems 
in a single framework. It is also demonstrated that the nested algorithm and the well separated 
data structure are suitable for the massively parallel computation, which enables us to extend the 
applicability of density functional calculations for large-scale systems together with the low-order 
scaling. 
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I. INTRODUCTION 

During the last three decades continuous efForts'^"'^^ 
have been devoted to extend applicability of the den- 
sity functional theory (DFT)ii^ to large-scale systems, 
which leads to realization of more realistic simulations 
being close to experimental conditions. In fact, lots of 
large-scale DFT calculations have already contributed 
for comprehensive understanding of a vast range of 
materials^"— although widely used functional such as 
local density approximation (LDA)^ and generalized 
gradient approximation (GGA)^ have limitation in de- 
scribing strong correlation in transition oxides and van 
der Waals interaction in biological systems. 

The efficient methods developed so far within the con- 
ventional DFT can be classified into two categories in 
terms of the computational complcxity,"^"^^ while the 
other methods, which deviate from the classification, 
have been also proposedi^l^— The first category con- 
sists of 0{N^) methods^'^ where TV is the number 
of basis functions, as typified by the Householder-QR 
methodfiiii^ the conjugate gradient method^'^'^ and the 
Pulay method,^!" which have currently become standard 
methods. The methods can be regarded as numeri- 
cally exact methods, and the computational cost scales 
as 0{N^) even if only valence states are calculated be- 
cause of the orthonormalization process. On the other 
hand, the second category involves approximate 0{N) 
methods such as the density matrix methodji^r^ the 
orbital minimization method )^^'^'^ and the Krylov sub- 
space methodi^iiii^^ of which computational cost is pro- 
portional to the number of basis functions N. The linear- 
scaling of the computational effort in the O(A^) methods 
can be achieved by introducing various approximations 
like the truncation of the density matrix^^ or Wannier 
functionsi^i^^ in real space. Although the 0{N) meth- 
ods have been proven to be very efficient, the applica- 



tions must be performed with careful consideration due 
to the introduction of the approximations, which might 
be one of reasons that the O(A^) methods have not been 
widely used compared to the 0(A^^) methods. From the 
above reason one may think of whether a numerically ex- 
act but low-order scaling method can be developed by 
utilizing the resultant sparse structure of the Hamilto- 
nian and overlap matrices expressed by localized basis 
functions. Recently, a direction towards the development 
of 0(A^^~) methods has been suggested by Lin et al., in 
which diagonal elements of the density matrix is com- 
puted by a contour integration of the Green function cal- 
culated by making full use of the sparse structure of the 
matrix. ■^^ Also, an efficient scheme has been presented 
by Li et al. to calculate diagonal elements of the Green 
function for electronic transport calculations^ which is 
based on the algorithm by Takahashi et al4^ and Eris- 
man and Tinney4i However, except for the two methods 
mentioned above the development of numerically exact 
0(A^^~) methods, which are positioned in between the 
O(A^) and 0(A^^) methods, has been rarely explored yet 
for large-scale DFT calculations. 

In this paper we present a numerically exact but low- 
order scaling method for large-scale DFT calculations of 
insulators and metals using localized basis functions such 
as pseudo-atomic orbital (PAO))^ finite element (FE),— 
and wavelet basis functions4i The computational effort 
of the method scales as 0(A^(log2 iV)^), 0{N'^), and 
0(A^'^/'^) for one, two, and three dimensional (ID, 2D, 
and 3D) systems, respectively. In spite of the low-order 
scaling, the method is a numerically exact alternative 
to the conventional 0{N^) methods. The key idea of 
the method is to directly compute selected elements of 
the density matrix by a contour integration of the Green 
function evaluated with a set of recurrence formulas. It 
is shown that a contour integration method based on 
a continued fraction representation of the Fermi-Dirac 



function'*^ can be successfully employed for the purpose, 
and that the number of poles used in the contour integra- 
tion does not depend on the size of the system. We also 
derive a set of recurrence formulas based on the nested 
dissection'*^ of the sparse matrix and a block LDL^ fac- 
torization using the Schur complementri^ to calculate se- 
lected elements of the Green function. The computa- 
tional complexity is governed by the calculation of the 
Green function. In addition to the low-order scaling, the 
method can be particularly advantageous to the mas- 
sively parallel computation because of the well separated 
data structure. 

This paper is organized as follows: In Sec. II the the- 
ory of the proposed method is presented together with 
detailed analysis of the computational complexity. In 
Sec. Ill several numerical calculations are shown to il- 
lustrate practical aspects of the method within a model 
Hamiltonian and DFT calculations using the PAO basis 
functions. In Sec. IV we summarize the theory and ap- 
plicability of the numerically exact but low-order scaling 
method. 



II. THEORY 

A. Density matrix approach 

Let us assume that the Kohn-Sham (KS) orbital (j)^, 
is expressed by a linear combination of localized basis 
functions {x} such as PAOj— FE,i2, and wavelet basis 
functions^ as: 



N 



^i^W =^c,,,x*(r)7 



(1) 



where N is the number of basis functions. Through- 
out the paper, we consider the spin restricted and k- 
independent KS orbitals for simplicity of notation. How- 
ever, the generalization of our discussion for these cases 
is straightforward. By introducing LDA or GGA for the 
exchange-correlation functional, the KS equation is writ- 
ten in a sparse matrix form: 



He. 



- Oi/O Ci/ , 



(2) 



where Si, is the eigenvalue of state v, c^ a vector con- 
sisting of coefhcients {cj^i}, and H and S are the Hamil- 
tonian and overlap matrices, respectively. Due to both 
the locality of basis functions and LDA or GGA for the 
exchange-correlation functional, both the matrices pos- 
sess the same sparse structure. It is also noted that the 
charge density n(r) can be calculated by the density ma- 
trix p: 



iW =^p»jXj(i-)x»(r)- 



(3) 



By remembering that x is localized in real space, one may 
notice that the product XiXj is non-zero only if they are 



closely located each other. Thus, the number of elements 
in the density matrix required to calculate the charge 
density scales as 0(A''). As well as the calculation of the 
charge density, the total energy is computed by only the 
corresponding elements of the density matrix within the 
conventional DFT as: 

Etot\n,p\ = Tr(pi7km)+ Mi"n(r)ucxt(r) 

'drdr'^'^^^E.M. (4) 

where i?kin is the matrix for the kinetic operator, Woxt 
an external potential, and i?xc an exchange-correlation 
functional. Since the matrix ffkin possesses the same 
sparse structure as that of S*, one may find an alterna- 
tive way that the selected elements of the density matrix, 
corresponding to the non-zero products XiXj^ are directly 
computed without evaluating the KS orbitals. The alter- 
native way enables us to avoid an orthogonalization pro- 
cess such as Gram-Schmidt method for the KS orbitals, of 
which computational effort scales as 0(A^'^) even if only 
the occupied states are taken into account. The direct 
evaluation of the selected elements in the density matrix 
is the starting point of the method proposed in the pa- 
per. The density matrix p can be calculated through the 
Green function G as follows: 



p = Im 

TT 



dSG(£; + iO+)/ 
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(5) 



where the factor 2 is due to the spin degeneracy, / the 
Fermi-Dirac function, /i chemical potential, T electronic 
temperature, fee the Boltzmann factor, and 0^ a positive 
infinitesimal. Also the matrix expression of the Green 
function is given by 



G{Z) = {zs-Hy\ 



(6) 



where Z is a complex number. Therefore, from Eqs. (5) 
and (6), our problem is cast to two issues: (i) how the 
integration of the Green function can be efficiently per- 
formed, and (ii) how the selected elements of the Green 
function in the matrix form can be efficiently evaluated. 
In the subsequent subsections we discuss the two issues 
in detail. 



B. Contour integration of the Green function 

We perform the integration of the Green function, 
Eq. (5) , by a contour integration method using a contin- 
ued fraction representation of the Fermi-Dirac function.— 
In the contour integration the Fermi-Dirac function is ex- 



pressed by 
1 
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where a; = j3{Z — ji) with /3 = tt^Y' ^p ^"^^ Rp ^^^^ poles of 
the continued fraction representation and the associated 
residues, respectively. The representation of the Fermi- 
Dirac function is derived from a hypergeometric function, 
and can be regarded as a Pade approximant when termi- 
nated at the finite continued fraction. The poles Zp and 
residues Rp can be easily obtained by solving an eigen- 
value problem as shown in Ref. [^^]. By making use of 
the expression of Eq. (7) for Eq. (5) and considering the 
contour integration, one obtain the following expression 
for the integration of Eq. (5): 



Ai 



= M(0)+Im -:^^G(ap)i?, 



/? 



(8) 



p=i 



where ap = /i + «%■, and M^^' is the zeroth order mo- 
ment of the Green function which can be computed by 
iRG{iR) with a large real number R. The structure of 
the poles distribution, that all the poles are located on 
the imaginary axis like the Matsubara pole, but the den- 
sity of the poles becomes smaller as the poles go away 
from the real axis, has been found to be very effective 
for the efficient integration of Eq. (5). It has been shown 
that only the use of the 100 poles at 600 K gives numer- 
ically exact results within double precision.*^ Thus, the 
contour integration method can be regarded as a numer- 
ically exact method even if the summation is terminated 
at a practically modest number of poles. 

Moreover, it should be noted that the number of poles 
to achieve convergence is independent of the size of sys- 
tem. Giving the Green function in the Lehmann repre- 
sentation, Eq. (8) can be rewritten by 

p = M(o)+imf-^f;^M(Mi?„^ 






M(")+^Im -5^ 



/3 ^ ap- Cj, 
p—i ^ 



Rp . (9) 



Although the expression in the second line is obtained 
by just exchanging the order of the two summations, 
the expression clearly shows that the number of poles 
for convergence does not depend on the size of system if 
the spectrum radius is independent of the size of system. 



Since the independence of the spectrum radius can be 
found in general cases, it can be concluded that the com- 
putational effort is determined by that for the calculation 
of the Green function. 

The energy density matrix e, which is needed to cal- 
culate forces on atoms within non-orthogonal localized 
basis functions, can also be calculated by the contour 
integration method^^ as follows: 



2 1"°° 
e = Im / dE EG{E + iO+)f 

T^ J-oo 



E- p. 
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= Af (1) + kM^°'^ + Im -^ E G(ap)i?pap 



p=i 



with K defined by 






(10) 
(11) 



where AI^^'> and Af *^^'' are the the zeroth and first order 
moments of the Green function, and can be computed by 
solving the following simultaneous linear equation: 



1 Zo 

1 z; 



A/(") 
A/(i) 



zoG{Zo) 
ziG{Z,) 



(12) 



The equation is derived by terminating the summation 
over the order of the moments in the moment represen- 
tation of the Green function. By letting zq and zi be iR 
and —R, respectively, AT ("^ and Af (^' are explicitly given 
by 






R 

1-i 
JR^ 



{G{iR) - G{-R)) , (13) 

{iGiiR) + Gi~R)), (14) 



where R should be a large real number, and 10^ is used in 
this study so that the higher order terms can be negligible 
in terminating the summation in the moment representa- 
tion of the Green function. Inserting Eqs. (13) and (14) 
into Eq. (10), we obtain the following expression which 
is suitable for the efficient implementation in terms of 
memory consumption: 



4i 



e = XG{iR}+jG{-R)+Imi~—J2G{ap)Rpap 



with A and 7 defined by 
R 



A = -{l + i){l + iKR), 
7 = -|(l + z)(l-Ki?). 



(15) 

(16) 
(17) 



One may notice that the number of poles for convergence 
does not depend on the size of system even for the calcu- 
lation of the energy density matrix because of the same 
reason as for the density matrix. 



C. Calculation of the Green function 

It is found from the above discussion that the computa- 
tional effort to compute the density matrix is governed by 
that for the calculation of the Green function, consisting 
of an inversion of the sparse matrix of which computa- 
tional effort by conventional schemes such as the Gauss 
elimination or LU factorization based methods scales as 
0(A^^). Thus, the development of an efficient method of 
inverting a sparse matrix is crucial for efficiency of the 
proposed method. 

Here we present an efficient low-order scaling method, 
based on a nested dissection approach;^ of computing 
only selected elements in the inverse of a sparse ma- 
trix. The low-order scaling method proposed here con- 
sists of two steps: (1) Nested dissection: by noting that 
a matrix X = {ZS — H) is sparse, a structured ma- 
trix is constructed by a nested dissection approach. In 
practice, just reordering the column and row indices of 
the matrix X yields the structured matrix. (2) Inverse 
by recurrence formulas: by recursively applying a block 
LDL^ factorization^'^ to the structured matrix, a set of 
recurrence formulas is derived. Using the recurrence for- 
mulas, only the selected elements of the inverse matrix 
X^^ = G{Z) are directly computed. The computational 
effort to calculate the selected elements in the inverse ma- 
trix using the steps (i) and (ii) scales as 0(iV(log2 N)'^), 
0(7V2), and 0{N'^^^) for ID, 2D, and 3D systems, re- 
spectively, as shown later. First, we discuss the nested 
dissection of a sparse matrix, and then derive a set of 
recurrence formulas of calculating the selected elements 
of the inverse matrix. 



1. Nested dissection 

As an example the right panel of Fig. 1(c) shows a 
structured matrix obtained by the nested dissection ap- 
proach for a finite chain model consisting of ten atoms, 
where we consider a s-valent nearest neighbor tight bind- 
ing (NNTB) model. When one assigns the number to the 
ten atoms as shown in the left panel of Fig. 1(a), then X is 
a tridiagonal matrix, of which diagonal and off-diagonal 
terms are assumed to be a and b, respectively, as shown 
in the right panel of Fig. 1(a). As the first step to gen- 
erate the structured matrix shown in the right panel of 
Fig. 1 (c) , we make a dissection of the system into the left 
and right domains^ by renumbering for the ten atoms, 
and obtain a dissected matrix shown in the right panel of 
Fig. 1(b). The left and right domains interact with each 
other through only a separator consisting of an atom 10. 
As the second step we apply a similar dissection for each 
domain generated by the first step, and arrive at a nested- 
dissected matrix given by the right panel of Fig. 1(c). The 
subdomains, which consist of atoms 1 and 2 and atoms 3 
and 4, respectively, in the left domain interact with each 
other through only a separator consisting of an atom 5. 
The similar structure is also found in the right domain 



(a) 



123 456 789 10 



(b) 

1 2 3 4 5 10 6 7 8 9 

(c) 

1 2 5 3 4 10 6 7 9 



(d) 



fa b 
b a b 
b a b 
b a b 
b a b 
b a b 
b a b 
b a b 
bob 




\ 


b a 1 


la b 
b a b 
b a b 
b a b 
b a 








a b 
b a b 
b a b 
b a 


W h \b Wall 






lab': I ': 
b a\ [b\ 




b 


\a'bm 
[b..aLA 

rbwm 




a-T 
b a b 
u'b 
b b.a 


W b \b \\a\l 





10 


^ 




\"5 




\"9 




n 

U....?J I?. ...Ij 


[^ZZi 


[Sj 



FIG. 1: (Color online) (a) The initial numbering for atoms 
in a linear chain molecule consisting of ten atoms described 
by the s-valent NNTB and its corresponding matrix, (b) the 
renumbering for atoms by the first step in the nested dissec- 
tion and its corresponding matrix, (c) the renumbering for 
atoms by the second step in the nested dissection and its cor- 
responding matrix, (d) the binary tree structure representing 
hierarchical interactions between domains in the structured 
matrix by the numbering shown in Fig. 1(c). 



consisting of atoms 6, 7, 9, and 8. It is worth mentioning 
that the resultant nested structure of the sparse matrix 
can be mapped to a binary tree structure which indi- 
cates hierarchical interactions between (sub)domains as 
shown in Fig. 1(d). By applying the above procedure to 
a sparse matrix, one can convert any sparse matrix into 
a nested and dissected matrix in general. However in 
practice there is no obvious way to perform the nested 
dissection for general sparse matrices, while a lot of effi- 
cient and effective methods have been already developed 
for the purpose.'**'*^ Here we propose a rather simple 
but effective way for the nested dissection by taking ac- 
count of a fact that the basis function we are interested 
in is localized in real space, and that the sparse struc- 
ture of the resultant matrix is very closely related to the 
position of basis functions in real space. The method 



bisects a system into two domains interacting through 
only a separator, and recursively applies to the resultant 
subdomains, leading to a binary tree structure for the 
interaction. Our algorithm for the nested dissection of a 
general sparse matrix is summarized as follows: 

(i) Ordering. Let us assume that there are Nd basis 
functions in a domain we are interested in. We order 
the basis functions in the domain by using the fractional 
coordinate for the central position of localized basis func- 
tions along a^-axis, where i = 1, 2, and 3. As a result of 
the ordering, each basis function can be specified by the 
ordering number^ which runs from 1 to Nd in the domain 
of the central unit cell. The ordering number in the pe- 
riodic cells specified by /a^, where I = 0, ±1,±2,---, is 
given by INd + q, where q is the corresponding order- 
ing number in the central cell. In isolated systems, one 
can use the Cartesian coordinate instead of the fractional 
coordinate without losing any generality. 

(ii) Screening of basis functions with a long tail. The 
basis functions with a long tail tend to make an efficient 
dissection difficult. The sparse structure formed by the 
other basis functions with a short tail is latescent due to 
the existence of the basis functions with a long tail. Thus, 
we classify the basis functions with a long tail in the do- 
main as members in the separator before performing the 
dissection process. By the screening of the basis func- 
tions with a long tail, it is possible to expose concealed 
sparse structure when atomic basis functions with a va- 
riety of tails are used, while a systematic basis set such 
as the FE basis functions may not require the screening. 

(iii) Finding of a starting nucleus. Among the localized 
basis functions in the domain, we search a basis function 
which has the smallest number of non-zero overlap with 
the other basis functions. Once we find the basis func- 
tion, we set it as a starting nucleus of a subdomain. 

(iv) Growth of the nucleus. Staring from a subdomain 
given by the procedure (iii), we grow the subdomain by 
increasing the size of nucleus step by step. The growth 
of the nucleus can be easily performed by managing the 
minimum and maximum ordering numbers, TOmin and 
"^max, which ranges from 1 to Nd. We define the subdo- 
main by basis functions with the successive ordering num- 
bers between the minimum and maximum ordering num- 
bers TOmin and m-niax- At each step in the growth of the 
subdomain, we search two basis functions which have the 
minimum ordering number rimin and maximum ordering 
number rimax among basis functions overlapping with the 
subdomain defined at the growth step. In the periodic 
boundary condition, n^in can be smaller than zero, and 
'T-max can be larger than the number of basis functions A^^- 
Then, the number of basis functions in the subdomain, 
the separator, and the other subdomain can be calculated 
by No = m,nax-m.min + l, Ns = ninax-".min + l-A^o, and 
Ai = Nd — Nq — Ns, respectively, at each growth step. 
By the growth process one can minimize (|Ao — Ai|-|-A"s) 
being a measure for quality of the dissection, where the 
measure (| Aq — A^i | -I- Ag) takes equal bisection size of the 
subdomains and minimization of the size of the separator 
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FIG. 2: (Color online) (a) The square lattice model, described 
by the s-valent NNTB, of which unit cell contains 1024 atoms 
with periodic boundary condition. The right blue and red 
circles correspond to atoms in two domains and a separator, 
respectively, at the first step in the nested dissection, (b) The 
square lattice model at the final step in the nested dissection. 
The separator at the innermost and the outermost levels are 
labeled as separators and 5, respectively, and the separators 
at each level are constructed by atoms with a same color. 



into account. Also, if (nmax — 'Trnin + l) is larger than Nd, 
then this situation implies that the proper dissection can 
be difficult along the axis. 

(v) Dissection. By applying the above procedures (i)- 
(iv) to each a^-axis, where i = 1, 2, and 3, and we can find 
an axis which gives the minimum ( | A"o ^ A^i | + A^g ) . Then, 
the dissection along the axis is performed by renumber- 
ing for basis functions in the domain, and two subdo- 
mains and one separator are obtained. Evidently, the 
same procedures can be applied to each subdomain, and 
recursively continued until the size of domain reaches the 
threshold. As a result of the recursive dissection, we ob- 
tain a structured matrix by the nested dissection. 

As an illustration we apply the method for the nested 
dissection to the finite chain molecule shown in Fig. 1. 
We first set all the system as domain, and start to apply 
the series of procedures to the domain. The procedure 
(i) is trivial for the case, and we obtain the numbering of 
atoms and the corresponding matrix shown in Fig. 1(a). 
Also it is noted that the screening of the basis functions 
with a long tail is unnecessary, and that we only have to 
search the chain direction. In the procedure (iii), atoms 
1 and 10 in Fig. 1(a) satisfy the condition. Choosing 
the atom 1 as a starting nucleus of the domain, and we 
gradually increase the size of the domain according to 
the procedure (iv). Then, it is found that the division 
shown in Fig. 1(b) gives the minimum (lA^o — A^i| + Ag). 
Renumbering for the basis functions based on the analy- 
sis yields the dissected matrix shown in the right panel of 
Fig. 1(b). By applying the similar procedures to the left 
and right subdomains, one will immediately find the re- 
sult of Fig. 1(c). In addition to the finite chain molecule, 
as an example of more general cases, the above algo- 



rithm for the nested dissection is applied to a s-valent 
NNTB square lattice model of which unit cell contains 
1024 atoms with periodic boundary condition. At the 
first step in the nested dissection, the separator is found 
to be red atoms as shown in Fig. 2(a). Due to the pe- 
riodic boundary condition, the separator consists of two 
lines. At the final step, the system is dissected by the 
recursive algorithm as shown in Fig. 2 (b). The separator 
at the innermost and the outermost levels are labeled as 
separators and 5, respectively, and each subdomain at 
the innermost level includes 9 atoms. As demonstrated 
for the square lattice model, the algorithm can be applied 
for systems with any dimensionality, and provides a well 
structured matrix for our purpose in a single framework. 



2. Inverse by recurrence formulas 

We directly compute the selected elements of the in- 
verse matrix using a set of recurrence formulas which 
can be derived by recursively applying a block LDL^ 
factorization to the structured matrix obtained by the 
nested dissection method as shown below. To derive the 
recurrence formulas, we first introduce the block LDLF 
factorization^^, for a symmetric square matrix X: 



X 



A B^ 
B C 

I 
L I 



A 
S 



I L'T 
/ 



(18) 



where A and C are diagonal block matrices, and B and 
B"^ an off-diagonal block matrix and its transposition, 
and L is given by 



L = BA- 



(19) 



Also the Schur complement S of the block element C is 
defined by 



S=C- BA-^B'^ = C - BL^ 



(20) 



Then, it is verified that the inverse matrix of X is given 
by 



X-i = 






(21) 



We now consider calculating the selected elements of the 
inverse of the structured matrix given in Fig. 1(c) using 
Eq. (21), and rewrite the matrix in Fig. 1(c) in a block 
form as follows: 



X = 



B^fi 



( ^0,0 

-Bo,0 ^0.1 C'o.o 



-Bi.o 



\ 



-^1,0 






^0,2 

^0,3 ^0,3 -^1,1 
^0,2 So, 3 Co,l 

-Bi,i Ci^o / 



(22) 



where Aq^ and -Bq.o correspond to 



and (0,6), re- 



spectively, and the other block elements can be deduced. 
Also the blank indicates a block zero element. Using 
Eq. (20) the Schur complement of Ci^o is given by 



'S'l^o — C'l^o ^ ^1, 0-^1,0 



^1, 1-^1,1' 



(23) 



where L^q is calculated by Eq. (19) and can be trans- 
formed using Eq. (21) to a recurrence formula as follows: 



J'^ - 



^0,0 

^0,1 
So,0 So,l Co,0 



^0,1 



■^1.0 



"^0.0 



A 



0.1 



■^1.0 



-^0 0*^0.0^0,0 -^n n^n n^n 



^0,0*^0, 0^0,1 ^0,0*5^0.0 



-^0,l'5'o,0-^0,0 -^0,1'S'o n-^0,1 — -^0,l'5'o.O 




with the definitions: 

•^1,0,0 

<o.i 
and 



^o,o(-^i,o[-Bo,o]) , 
^o',i(-Si,o[-Bo,i]) , 



^1,0 



(24) 



(25) 
(26) 



*3i,i,o - 'S'o.o (-So,oV"i_o,o + So,iV"i_o,i " (5i,o[Co,o]) ) ■ 

(27) 

In Eqs. (25), (26), and (27), we used a bra-ket notation 
[ ] which stands for a part of the block element. For 
example, i?i.o[-Bo.o] means a part of i?i.o which has the 
same columns as those of Bqq. It is noted that one can 
obtain a similar expression for L\^ as well as Eq. (24) 
for L:[o. 

To address a more general case where the dissection 
for the sparse matrix is further nested, we suppose that 
the matrix Aq^o has the same inner structure as 




then one may notice the recursive structure in Eq. (24), 
and can derive the following set of recurrence relations 
for general cases: 



'^p.m+l.n " 
{Bm.2nV„ , 



^m,n ^ 



p,7n,2n 



Bm,2n+lVp„^2n+l 



[Bp^q[Cm,n\j j , 

(28) 



•^ p,Tn-\-l,n 



K 










^m,2n 



Ql. 



rn+l,^-" 



(29) 



Equation ((29)) is the central recurrence formula coupled 
with Eq. ([25)) . where the initial block elements are given 
by 



p,v,n 



(Aqji) (-Bp,q[-Bo,n]) ■ 

Also Lp_n and 5p_„ can be calculated by 

^p.n — *^p,p,n5 



^p,n — (-'p,n ~ [^p,2n: J^p,2n+l) I tT 



i,2n 



^p,2n+l 



(30) 

(31) 
(32) 



A set of Eqs. ()^ -(32) enables us to calculate all the 
inverses of the Schur complements S and L. In the re- 
currence equations Eqs. (|28)) and (|29)) . three indices of p, 
771, and 77 are involved, and they run as follows: 



P = 0,---,P. 

771 — 0, • • • ,p — 1. 
77 = 0, ••• ,2 



P-T 



1. 



(33) 

(34) 
(35) 



The index p denotes the level of hierarchy in the nested 
dissection and the innermost and outermost levels are set 
to and P, respectively. Then, it is noted that the total 
system is divided into 2^"*"^ domains at the innermost 
level. As well as p the index 777 is also related to the level 
of hierarchy in the nested dissection, and runs from to 
p — 1. The index 77. is a rather intermediate one, being 
dependent on ttt,. The indices 77 in Eq. (30) is dependent 
on p and q and they run as follows: 



q{2n 



,2^+1-p-i. 



1. 



(36) 

(37) 



Since the set of the recurrence formulas Eqs. ([28|)-(32) 
proceed according to Eqs .(33)-(35), the development of 
recurrence can be illustrated as in Fig. 3. The recurrence 
starts from Eq. (30) with p = 0, and Eqs. (31) and (32) 
follow. Then, p is incremented by one, and 777 + 1 climbs 
up to 1 . The increment of p and the climbing of 777 + 1 are 
repeated until p ^ P and 777+1 = P. At777+l=p for 
each p, L and S are evaluated by Eqs. (31) and (32), and 
the inverse of S is calculated by a conventional method 
such as LU factorization, which are used in the next re- 
currence for the higher level of hierarchy. The numbers 
in the right hand side of Fig. 3 give the multiplicity for 
similar calculations by Eq. ()29p coming from the index 
77 at each 777 + 1, since 77 runs from to 2^~™ — 1 as 
given in Eq. (35). The computational complexity can be 
estimated by Fig. 3, and we will discuss its details later. 
We are now ready to calculate the selected elements of 
the Green function using the inverses of the Schur com- 
plements S and L calculated by the recurrence formulas 
of Eqs. (|28)) -(32). By noting that Eq. (21) has a recursive 
structure and the matrix X is structured by the nested 
dissection, one can derive the following recurrence for- 



m+1 




FIG. 3: (Color online) The development of recurrence formu- 
las Eqs. (15S}-(32), which implies that the recurrence starts 
from p = m + 1 = and ends at p = 771 + 1 = P. The 
number in the right hand side is the multiplicity for similar 
calculations by Eq. H29[) due to the index n at each m+ 1. 



mula: 



X 



X 



-1 

p+l^n 



p.2n 



X 



-1 

p,2n+l 







VT T 
^p,2n^P;2n 



-K 



J^- 



■^,2n+l^P.2n+l ^,2n+l | ;(38) 

^Yp^2n -~Yp,2n+l 






where 



Y' 



p.27i 

rT _ tT 



^p,2n^p,n^ 



-'p.2n+l ^ ^p,2n+l'~'p,n- 



(39) 



The recurrence formula Eq. (38) starts with Xq ^ = 
{Ao.n)~^, adds contributions at 777 + 1 — p for every 
p, and at last yields the inverse of the matrix X as 
X~^ = G{Z) = Xp^j^ g. Since the calculation of each 
element for the inverse of X can be independently per- 
formed, only the selected elements can be computed with- 
out calculating all the elements. The selected elements 
to be calculated are elements in the block matrices A, B, 
and C, each of which corresponds to a non-zero overlap 
matrix as discussed before. Thus, we can easily compute 
only the selected elements using a table function which 
stores the position for the non-zero elements in the block 
matrices A, B, and C. 

A simple but nontrivial example is given in Appendix 
A to illustrate how the inverse of matrix is computed 
by the recurrence formulas, and also a similar way is 
presented to calculate a few eigenstates around a selected 
energy in Appendix B, while the proposed method can 
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calculate the total energy of system without calculating 
the eigenstates. 



3. Finding chemical potential 

As well as the conventional DFT calculations, in the 
proposed method the chemical potential has to be ad- 
justed so that the number of electrons can be conserved. 
However, there is no simpler way to know the number of 
electrons under a certain chemical potential before the 
contour integration by Eq. (8) with the chemical poten- 
tial. Thus, we search the chemical potential by iterative 
methods for the charge conservation. Since the contour 
integration is the time-consuming step in the method, 
a smaller number of the iterative step directly leads to 
the faster calculation. Therefore, we develop a careful 
combination of several iterative methods to minimize the 
number of the iterative step for sufficient convergence. In 
general, the procedure for searching the chemical poten- 
tial can be performed by a sequence (l)-(2) or (5)-(l)- 
(3)-(l)-(4)-(l)-(4)-(l)- • • in terms of the following proce- 
dures. As shown later, the procedure enables us to obtain 
the chemical potential conserving the number of electrons 
within 10~^ electron/system by less than 5 iterations on 
an average. 

(1) Calculation of the difference AiVp in the total num- 
ber of electrons. The difference l^Ni in the total number 
of electrons is defined with /o(^i) calculated using Eq. (8) 
at a chemical potential /j.^ by 



AA^, = Tr(p(^,)5)-iVi 



ideal: 



(40) 



where Mdcai is the number of electrons that the system 
should possess for the charge conservation. If AAq is 
zero, the chemical potential /io is the desired one of the 
system. 

(2) Using the retarded Green function. If the difference 
AAq is large enough so that the interpolation schemes 
(3) and (4) can fail to guess a good chemical potential, 
the next trial chemical potential is estimated by using the 
retarded Green function. When the chemical potential of 
/itri is considered, the correction 6Ntri to AA^i estimated 
by the retarded Green function is given by 



5N,,, = 



dESp{E)AfiE,fitri), 



Bmin 



where 5p{E) and Af{E, /xtri) are defined by 

Sp{E) = ImTr {G{E + iTj)S) 

n 

with a small number rj (0.01 eV in this study) and 



A/(i^,A^tH) = f[^^^)-f^''-^' 



kBT 



ksT 



(41) 



(42) 



(43) 



The integration in Eq. (41) is numerically evaluated by 
a simple quadrature scheme such as trapezoidal rule 



with a similar number of points as for that of poles in 
Eq. (8), and the integration range can be determined 
by considering the surviving range of Af{E,fitn)- The 
search of /itri is performed by a bisection method un- 
til AA'cri > {ANi +6Ntri), where AA'cri is a criterion for 
the convergence, and 10^^ electron/system is used in this 
study. It should be noted that the evaluation of Green 
function being the time-consuming part can be performed 
before the bisection method and a set of 5p{E) is stored 
for computational efficiency. 

(3) Linear interpolation/ extrapolation method. In 
searching the chemical potential /i, if two previous re- 
sults (/ii, ANi) and (/ij, ANj) are available, a trial chem- 
ical potential /itri is estimated by a linear interpola- 
tion/extrapolation method as: 



Mtri 



PjAN, - ^i.ANj 
Pi - fij 



(44) 



(4) Muller metho($^^. In searching the chemical po- 
tential jjL., if tree previous results {pi,ANi), (nj^ANj), 
and {pk,ANk) are available, they can be fitted to a 
quadratic equation: 



AA^ = ap^ + bp + c, 



(45) 



where a, 6, and c are found by solving a simultaneous 
linear equation of 3 x 3 in size.^^ Then, /itri giving A A^ = 
is a solution of Eq. (45), and given by 






6>0, 
6<0. 



(46) 



The selection of sign is unique because of the condition 
that the gradient at the solution must be positive, and 
the branching is taken into account to avoid the round-off 
error. As the iteration proceeds in search of the chem- 
ical potential, we have a situation that the number of 
available previous results is more than three. For the 
case, it is important to select three chemical potentials 
having smaller AA^ and the different sign of AA^ among 
three chemical potentials, since the guess of /itri can be 
performed as the interpolation rather than the extrapo- 
lation. 

(5) Extrapolation of chemical potential for the sec- 
ond step. During the self-consistent field (SCF) 
iteration, the chemical potential obtained at the 
last SCF step is used as the initial guess pi in 
the current SCF step. In addition, we estimate 
the second trial chemical potential by fitting results 
{p^^\An[^p^^\a4% ip['\ANi'\pi'\A4% and 
(Pi ,AN{^ , P2 jAA'j ), where the subscript and the 
superscript in Pq and AA'q mean the iteration step in 
search of the chemical potential and the SCF step, re- 
spectively, at three previous SCF steps to the following 
equation: 

AN2 = aiAA^i + a2{p2 - Pi) + a3ANi{p2 - pi), (47) 



TABLE I: Some of nS^ and iV^^^ in Eq. (50) for a finite ID 
chain, a finite 2D square lattice, and a finite 3D cubic lattice 
described by the s-valent NNTB model. They depends on m 
or p for the 2D and 3D systems in a rather complicated way, 
while Np^}ri,n = oP^m for ^U the cases. The unit for each case 
is given in parenthesis. 



m+1 or p P P-1 P-2 P-3 P-4 P-5 P-6 P-7 P-8 P-9 P-10 


ID (1) 1 1 

2D (Afi/2) 1 i 
3D (iV2/3) 1 1 


1 

1 

2 
1 
4 


1 

1 
4 
1 
4 


1 

1 
4 
1 

8 


11111 1 
11111 1 
8 8 16 16 32 32 
11111 1 
16 Ifi 32 64 R4 128 



(3) 

and Np is the dimension of column in the matrix 



Qvm+i n- Since Eq. ([^ consists of a matrix product, 

(1) w(2) Ar(3) 



the computational cost is simply given by Nm' N, 



'K 



Also it is noted that Nm and Nm depend on only m, 

(3) 

and Np has dependency on only p because of the sim- 
plicity of the systems we consider. 

For the finite ID chain system, we see that Nm = 
N/{2^-"^) and A^i^' = N^'^^ = 1 as listed in Table I. 
Thus, the computational cost iiD for the ID system is 
estimated as 



where oi, 02, and 03 are found by solving a simultaneous 
linear equation of 3 x 3 in size. Then, the chemical po- 
tential ^2 giving AN2 — can be estimated by solving 
Eq. (47) with respect to /i2 as follows: 



Mtri = M2 = Ml 



aiAAi 



02 + a^ANi 



(48) 



p p-i 2^-"- 

^iD « E E E 2P 

p—l m—O n— 

= ^NP{P+l). 



N 



(50) 



It is found from numerical calculations that Eq. (48) pro- 
vides a very accurate guess in most cases as the SCF 
calculation converges. 



D. Computational complexity 

We analyze the computational complexity of the pro- 
posed method. As discussed in the subsection Contour 
integration of the Green function, the number of poles 
for the contour integration is independent of the size of 
system. Thus, we focus on the computational complexity 
of the calculation of the Green function. For simplicity 
of the analysis we consider a finite chain, a finite square 
lattice, and a finite cubic lattice as representatives of ID, 
2D, and 3D systems, respectively, which are described by 
the s-valent NNTB models as in the explanation of the 
nested dissection. Note that the results in the analysis 
are valid for more general cases with periodic boundary 
conditions. Since the computational cost is governed by 
Eq. (1^^ . let us first analyze the computational cost of 
Eq. (1^^ . while those of the other equations will be dis- 
cussed later. Considering that the recurrence formula of 
Eq. p9| develops as shown in Fig. 3, and that the cal- 
culation of Eq. (I^ni) corresponds to the open circle in the 
figure, the computational cost t can be estimated by 



Noting A^ c>c 2^, we see that the computational cost for 
the ID system scales as 0(A^(log2(A))2). 

For the finite 2D square lattice system, we see Am — 



r(2) 



(3) 



N/{2 '"), and A'^ and Np depend on m and p, re- 
spectively as shown in Table L To estimate the order of 

(2) (3) 

the computational cost we approximate Nm and A'p 

as ivi'^ « ^i/2/2^(^-'"-i) and iV^'^ « ^i/2/2^(^-p) 
which are equal to or more than the corresponding exact 
number. Then, the computational cost i2D for the 2D 
system can be estimated as follows: 



P p-l 2^"™-l 

*2D oc E E E 

p—l m—O n— 
P p-l 2-^""'-! 

<j:e e 



A^ 






p—l m—O n— 

2N^ 



cyP—r}i PiTn^n p,m,n 

N JVi/2 JVl/2 

2P-m 2i(P-m-l) 2HP-P) 

1 \ 



(72-1)2 



2-^/2 + ^-i 



2P 2P 2^/2 



(51) 



p p-i 2^ 



^-EE E 

p—l m—O n— 



iV(i)Ar(2)7v(3) 

m ra p ' 



Since the first twos term in parenthesis of the last line 
(49) are the leading term, we see that the computational cost 
for the 2D system scales as 0(A^^). 



where Nm and Am are the dimension of row and col- 
umn in the matrix: 



/ L^ 



m,2n 
^m,2n+l 



V -I 



For the finite 3D cubic lattice system we have A^m = 
Ar/(2^-'") as weU as the ID and 2D systems. As 
shown in the analysis of the 2D systems, by approxi- 



r(2) 



and Ni^'' as A, 



vp 



r(2) 



Ar2/3/2i(P-™^i) ^^^ 



mating N„ 

the corresponding exact number, we can estimate the 



A^^'^'' w A^2/3^23(^ p), which are equal to or more than 



10 



TABLE II: Computational order of Eqs. ^, (HSJ, pO)! . 
((32}, (|38p . and (|39p . where the calculation of the inverse of 
the matrix S is also included in estimating the computational 
cost of Eq. ((32} , and the sparse structure in the matrix B is 
taken into account for Eqs. (1281) and ((321) • 





ID 


2D 


3D 


Eq. ((28} 


i^og, Nf 


iV^/2 log^ iv 


iV2 


Eq. ((23) 


N{log^ Nf 


iV2 


iV^/3 


Eq. ((30} 


Nlog^N 


iV3/2 


iV5/3 


Eq. ((32} 


log.N 


A/- 


^4/3 


Eq. ((38} 


N 


iV3/2 


J\[^/^ 


Eq. ((M} 


Nlog^N 


iV2 


7V7/3 



computational cost tso for the 3D system as follows: 

P p-l 2-^"'"-! 



*3D OC ^^ ^ 






p—1 m—O n—0 
PP_1 2-^-1 ^ ^2/3 ^2/3 

< EE E 



p^l m—O n—0 
4iV7/3 



22/36 - 9 
1 



2P-m 2^{P-m-l) 2iiP-P) 
1 



1 -L 2^/3 _ 

+ ^ 22/324^/3 

22/3 1 



22/322P/3 22^/3 24^/3 



(52) 



Since we see that the first two terms in parenthesis of the 
last line are the leading term, it is concluded that the 
computational cost for the 3D system scales as 0(iV^/'^). 
We further analyze the computational cost of the other 
Eqs. (EHj), ((201), (EH), ((SHI, and ((23) which are the pri- 
mary equations for the calculation of the Green function. 
Although the detailed derivations are not shown here, 
they can be derived in the same way as for Eq. (|29p . 
Table II shows the order of the computational cost for 
each equation. It is found that the computational cost 
is governed by Eq. ((29| . while the computational cost 
of Eq. ((391) is similar to that of Eq. ^. Thus, it is 
concluded that as a whole the proposed method scales 
as 0(7V(log2 N)^), 0{N^), and 0(iVV3) ^^ ij^^ 2D, and 
3D systems, respectively.— 
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FIG. 4: (Color online) The elapsed time of the inverse cal- 
culation by Eqs. ((28}- (32) for a ID linear chain, a 2D square 
lattice, and a 3D cubic lattice systems as a function of number 
of atoms in the unit cell under periodic boundary condition. 
The Hamiltonian of the systems are described by the s-valent 
NNTB models. The line for each system is obtained by a 
least square method, and the computational orders obtained 
from the fitted curves are 0(iV°-^''(log2 iV)^), 0(iV^-^"), and 



0(iV 



2.35\ 



for the ID, 2D, and 3D systems, respectively. The 



size of domains at the innermost level is set to 20 for all the 



the abbreviation of basis function such as C5.0-slpl rep- 
resents that C stands for the atomic symbol, 5.0 the cut- 
off radius (bohr) in the generation by the confinement 
scheme, slpl means the employment of one primitive or- 
bitals for each of s and p orbitals4^ Since the PAO basis 
functions are pseudo-atomic orbitals with different cutoff 
radii depending on atomic species, the resultant Hamilto- 
nian and overlap matrices have a disordered sparse struc- 
ture, reflecting the geometrical structure of the system. 
Norm-conserving pseudopotentials are used in a separa- 
ble form with multiple projectors to replace the deep core 
potential into a shallow potential. ^^ Also a local density 
approximation (LDA) to the exchange-correlation poten- 
tial is employed)^ 



A. Scaling 



III. NUMERICAL RESULTS 

In the section several numerical calculations for the 
s-valent NNTB model and DFT are presented to illus- 
trate the low-order scaling method. All the DFT calcu- 
lations in this study were performed by the DFT code 
OpenMX^ The PAO basis functionsi^ used in the DFT 
calculations are specified by II4.5-sl, C5.0-slpl, N4.5- 
slpl, 04.5-slpl, and P6.0-slpldl for deoxyribonuleic 
acid (DNA), C4.0-slpl for a single Ceo molecule, and 
Pt7.0-s2p2(il for a single Ptgs cluster, respectively, where 



As shown in the previous section, it is possible to re- 
duce the computational cost from 0{N^) to the low-order 
scaling without losing numerical accuracy. Here we val- 
idate the theoretical scaling property of the computa- 
tional effort by numerical calculations. Figure 4 shows 
the elapsed time required for the calculation of inverse of 
a ID linear chain, a 2D square lattice, and a 3D cubic 
lattice systems as a function of number of atoms in the 
unit cell under periodic boundary condition, which are 
described by the s-valent NNTB models. The last three 
points for each system are fitted to a function by a least 
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FIG. 5: (Color online) The norm of residual in the SCF cal- 
culation of DNA, with a periodic double helix structure (650 
atoms/unit) consisting of cytosines and guanines, calculated 
by the conventional and proposed methods, where the resid- 
ual is defined as the difference between the input and output 
charge densities in momentum space. The electric temper- 
ature of 700 K and 80 poles for the contour integration are 
used. The number in parenthesis is the total energy (Hartree) 
of the system calculated by each method. 
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FIG. 6: (Color online) The number of iterations for searching 
the chemical potential which conserves the total number of 
electrons within a criterion of 10~* electron/system for a Ceo 
molecule, DNA, and a Ptes cluster, where the electric tem- 
perature of 600, 700, and 1000 K, and 80, 80, and 90 poles for 
the contour integration are used for the Ceo molecule, DNA, 
and the Ptes cluster, respectively. 



square method, and the obtained scalings of the elapsed 
time are found to be O(iV"-90(log2 TV)^), 0{N^-^°), and 
0(A^2.35^ for the ID, 2D, and 3D systems, respectively. 
Thus, we confirm that the scaling of the computational 
cost is nearly the same as that of the theoretical estima- 
tion. 



B. SCF calculation 

To demonstrate that the proposed method is a numer- 
ically exact method even if the summation in Eq. (8) is 
terminated at a modest number of poles, we show the 
convergence in the SCF calculations calculated by the 
conventional diagonalization and the proposed methods 
for deoxyribonuleic acid (DNA) in Fig. 5, where 80 poles 
is used for the summation, and the electronic temper- 
ature is 700 K. It is clearly seen that the convergence 
property and the total energy are almost equivalent to 
those by the conventional method with only 80 poles. 



C. Iterative search of chemical potential 

Although the computational cost of the proposed 
method can be reduced from the cubic to low-order scal- 
ings, the prefactor directly depends on the number of it- 
erations in the iterative search of the chemical potential. 
To address how the combination of interpolation and ex- 
trapolation methods discussed before works to search a 
chemical potential which conserves the total number of 



electrons within a criterion, we show in Fig. 6 the num- 
ber of iterations for finding the chemical potential, con- 
serving the total number of electrons with a criterion of 
10~^ electron/system, as a function of the SCF step for 
a Cgo molecule, DNA, and a Ptga cluster. Only few iter- 
ations are enough to achieve a sufficient convergence of 
the chemical potential as the SCF calculation converges, 
while a larger number of iterations are required at the ini- 
tial stage of the SCF step. It turns out that the proper 
chemical potential can be searched by the mean itera- 
tions of 2.1, 2.4, and 4.0 for a Ceo molecule, DNA, and a 
Ptes cluster, respectively. The property of the iterative 
search is closely related to the energy gap of systems. 
The energy gap between the highest occupied and low- 
est unoccupied states of the Cgo molecule, DNA, and 
Ptes cluster are 1.95, 0.67, and 0.02 eV, respectively. For 
the Ceo molecule and DNA with wide gaps the number 
of iterations for finding the chemical potential tends be 
large up to 10 SCF iterations, since the interpolation or 
extrapolation scheme may not work well due to the exis- 
tence of the wide gap. However, once the charge density 
nearly converges, the approximate chemical potential in 
between the gap, which is the correct chemical poten- 
tial at the previous SCF step, can satisfy the criterion 
of 10^* electron/system. The situation does correspond 
to a small number of iterations after 10 SCF iterations. 
Even the trial chemical potential at the first step is the 
correct one within the criterion after 26 SCF iterations in 
these cases. For the Ptga cluster with the narrow gap the 
number of iterations for finding the chemical potential is 
slightly lower than those of the a Cgo molecule and DNA 
with the wide gaps at the initial stage of SCF iterations. 



12 



400 



■| 300 

DC 

a. 

i 200 

CD 

a. 
(f) 



100 



1 thread 

2 threads 
4 threads 
Conventional (1 thread) 




60 80 100 120 140 160 180 
Number of Processes 



FIG. 7: (Color online) Speed-up ratio in the parallel compu- 
tation of the diagonalization in the SCF calculation for DNA 
by a hybrid scheme using MPI and OpenMP. The speed-up 
ratio is defined by 2T2/Tp, where T2 and Tp are the elapsed 
times obtained by two MPI processes and by the correspond- 
ing number of processes and threads. The structure of DNA 
is the same as in Fig. 5. The parallel calculations were per- 
formed on a Cray XT5 machine consisting of AMD opteron 
quad core processors (2.3 GHz). The electric temperature of 
700 K and 80 poles for the contour integration are used. For 
comparison, the speed-up ratio for the parallel computation of 
the conventional scheme using Householder and QR methods 
is also shown for the case with a single thread. 



which implies that the interpolation and extrapolation 
schemes by the procedures (3), (4), and (5) can give a 
good estimation of the chemical potential for the nearly 
continuous eigenvalue spectrum. In addition to this, one 
may find that in contrast to the cases with the wide gap, 
the correct chemical potential is found by two iterations 
as the charge density converges, since a little change of 
the chemical potential affects the distribution of charge 
density due to the narrow gap. However, the fact that 
only two iterations are sufficient even for the system with 
a narrow gap at the final stage of the SCF step suggests 
that the extrapolation by the procedure (5) works very 
well. Thus, we see from the numerical calculations that 
the correct chemical potential can be searched by only 
few iterations on an average with the combination of in- 
terpolation and extrapolation methods for systems with 
a wide variety of gap. 



D. Parallel calculation 

We demonstrate that the proposed method is suitable 
for the parallel computation because of the well separated 
data structure. It is apparent that the calculation of the 
Green function at each Up in Eq. (8) can be independently 
performed without data communication among proces- 
sors. Thus, we parallelize the summation in Eq. (8) by 



using the message passing interface (MPI) in which a 
nearly same number of poles are distributed to each pro- 
cess. The summation in Eq. (8) can be partly performed 
in each process, and the global summation is completed 
after all the calculations allocated to each process finish. 
In most cases the global summation can be a very small 
fraction of the computational time even including the 
MPI communication, since the amount of the data to be 
communicated is 0(iV) due to the use of localized basis 
functions. In addition to the parallelization of the sum- 
mation in Eq. (8), the calculation of the Green function 
can be parallelized in two respects. In the recursive cal- 
culations of Eqs. ((28|) -(32). one may notice that the cal- 
culation for different n is independently performed, and 
also the calculations involving V'^ and L^ in Eqs. ((28|)- 
(32) can be parallelized with respect to the column of 
V^ and L^ without communication until the recurrence 
calculations reach &t m + 1 = p. For each p the MPI 
communication only has to be performed &t m + \ — p. 
In our implementation only the latter part as for the cal- 
culation of the Green function is parallelized by a hybrid 
parallelization using MPI and OpenMP, which are used 
for internodes and intranode parallelization. As a whole, 
we parallelize the summation in Eq. (8) using MPI and 
the calculations involving V'^ and L-^ in Eqs. (|28p -(32) 
using the hybrid scheme. 

Figure 7 shows the speed-up ratio by the parallel cal- 
culation in the elapsed time of one SCF iteration. The 
speed-up ratio reaches about 350 and the elapsed time 
obtained is 3.76 sec using 81 processes and 4 threads, 
which demonstrates the good scalability of the proposed 
method. On the other hand, the conventional diagonal- 
ization using Householder and QR methods scales up to 
only 21 processes, which leads to the speed-up ratio of 
10 and the elapsed time of 7.09 sec. Thus, we see that 
the proposed method is of great advantage to the parallel 
computation unlike the conventional method, while the 
comparison of the elapsed time suggests that the prefac- 
tor in the computational effort for the proposed method 
is larger than that of the conventional method. 



IV. CONCLUSIONS 

An efficient low-order scaling method has been devel- 
oped for large-scale DFT calculations using localized ba- 
sis functions such as the PAO, FE, and wavelet basis 
functions, which can be applied to not only insulating but 
also metallic systems. The computational effort of the 
method scales as 0{N{\og^Nf), 0{N^), and 0(7V'^/3) 
for ID, 2D, and 3D systems, respectively. The method 
directly evaluates based on two ideas only selected ele- 
ments in the density matrix which are required for the 
total energy calculation. The first idea is to introduce 
a contour integration method for the integration of the 
Green function in which the Fermi-Dirac function is ex- 
pressed by a continued fraction. The contour integration 
enables us to obtain the numerically exact result for the 
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integration within double precision at a modest number 
of poles, which allows us to regard the method as a nu- 
merically exact alternative to conventional O(iV^) diago- 
nalization methods. It is also shown that the number of 
poles needed for the convergence does not depend on the 
size of the system, but the spectrum radius of the system, 
which implies that the number of poles in the contour in- 
tegration is unconcerned with the scaling property of the 
computation cost. The second idea is to employ a set of 
recurrence formulas for the calculation of the Green func- 
tion. The set of recurrence formulas is derived from a re- 
cursive application of a block LDL^ factorization using 
the Schur complement to a structured matrix obtained by 
a nested dissection for the sparse matrix {ZS — H). The 
primary calculation in the recurrence formulas consists 
of matrix multiplications, and the computational scaling 
property is derived by the detailed analysis for the cal- 
culations. The chemical potential, conserving the total 
number of electrons, is determined by an iterative search 
which combines several interpolation and extrapolation 
methods. The iterative search permits to find the chem- 
ical potential by less than 5 iterations on an average for 
systems with a wide variety of gap. The good scalabil- 
ity in the parallel computation implies that the method 
is suitable for the massively parallel computation, and 
could extend the applicability of DFT calculations for 
large-scale systems together with the low-order scaling. 



structured matrix: 



X 



(l 1 \ 


1 1 1 


1 1 1 


1 1 


1 1 1 


1 1 1 


[ 1 1 lyl 



(Al) 



where the blank means a zero element. It can be seen 
that the strucutre is same as in Eq. ([22|) . and the system 
is divided into four domains with P = I. Then, we start 
from Eqs. §0^ with p = 0, 



(A2) 
(A3) 
(A4) 
(A5) 



^0,0,0 - (^o,o) {Bofi[Bofi]) 

= 1x1 = 1 = L^o, 

= Ix 1 = 1 = Ll„ 

= 1X1 = 1 = L^2, 



vl 



0,3 



1 X 1 = 1 = L, 



0,3' 



and proceed to calculate Eq. ([5^ . 
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Appendix A: AN EXAMPLE OF THE INVERSE 
CALCULATION 



Since the proposed method to calculate the inverse of 
matrix is largely difference from conventional methods, 
we show a simple but nontrivial example to illustrate 
the calculation of the inverse by using the set of recur- 
rence formulas Eq. ([281) -(|32]), ([MJ, and (|39l), which may 
be useful to understand how the calculation proceeds. 
We consider a finite chain molecule consisting of seven 
atoms described by the same s-valent NNTB model as in 
the subsection Nested dissection^ where all the on-site en- 
ergies and hopping integrals are assumed to be 1. After 
performing the nested dissection, we obtain the following 



'S'0,0 — ^0,0 — BoflL^Q — _Bo, 1-^0,1 

= 1-1x1-1x1 = -1, 

'S'0,1 — Cos — -So, 2-^0, 2 ~ ^0,3-^0,3 

= 1-1x1-1x1 = -!. 



(A6) 



(A7) 



X-^ Q and X^ -^ which are precursors of the inverse of X 



can be calculated by Eq. ()38|) and (|39l) as 
^0.1 



^0.0 



X 



1,0 



^0,0-^0,0 * 

^0,1-^0 



-^0,0 




where * means that the corresponding element is not cal- 
culated, and remains unknown, since these elements are 
not referred for further calculations. The precursor of 
X^^ is found to be same as A"i.o due to the same inner 
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structure. As the next step, we set p to 1, and calculate 
Eq. dSOD, 

^1,0,0 — (^o,o)~ (5i,o[-Bo,o]) 

= 1x0 = 0, (A9) 

Vi^,i = (^oa)-\Si,o[Boa])^ 

= 1x1 = 1, (AlO) 

VlA2 = (^0,2)-'(Sl,l[i3o,2])^ 

= 1x1 = 1, (All) 

= 1x0 = 0, (A12) 



Eq. dSHD, 

*5i,i,o = '5'o,o (-Bo, 0^1,0.0 + ^0,1^1,04 ^ (^i,o[Co,o]) ) 

= (-1)(1 xO+1 X 1-0) = -1, (A13) 

Ql,l,l = 'S'o^l (-^0,2^^0,2 + -^0,3^1,0,3 ~ (-Sl,l[Co,l]) ) 

= (-1)(1 X 1 + 1 xO-0) = -1, (A14) 



Eqs. dMl) and ([31 




^M,i = I n^,3 \^\^IA QliA 



;).(:),-.^(-}i^^i,. 



(A16) 



and Eq. dH 



'S'l.o — C'l^o ^ BifiLi g — Bi^iLi I 




10 0)1 -1 1=1. (A17) 



Finally updating the precursors X^ g and X^ I of the 
inverse of the matrix X using Eqs. psp and ([M)l yields 



the inverse of X as follows: 



^2,0 - 




'1^0****' 

* 1 * * * 
1 * * * * 

* =K =K * 1 = X^^. (A18) 

***:)< 1 * 
* * * 1 * 

\*0*0**1/ 

The calculated elements in the inverse X~^ are found to 
be consistent with those by conventional methods such 
as the LU method. It is also noted that one can easily 
obtain the corresponding elements in the inverse of the 
original matrix using a table function generated in the 
nested dissection which converts the row or column index 
of the structured matrix to the original one. 

Appendix B: Calculation of selected eigenstates 

In the appendix, it is shown that a few eigenstates 
around a selected energy ^ can be obtained by a similar 
way with the same computational complexity as in the 
calculation for the density matrix, though the proposed 
method directly computes the density matrix without ex- 
plicitly calculating the eigenvectors. 

We compute the few eigenstates around ^ using a block 
shift-invert iterative method in which the generalized 
eigenvalue problem of Eq. (2) is transformed as 



[H-^sy^Sc, 



1 



£i^ -£. 



(Bl) 
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Then, the following iterative procedure yields a set of 
eigenstates around ^ as the convergent result. 



hi = {H-iS)-^Scu 



(B2) 



{hi\H\hi)ci, 



(b,|5|bOc,+ie,+i, (B3) 



where I is the iterative step, e is a square matrix consist- 
ing of diagonal elements, and b and c are a set of vectors 
of which number is that of the selected states. The matrix 
multiplication in Eq. (B2) and the solution of the gener- 
alized eigenvalue problem for Eq. (B3) are repeated until 
convergence, and the convergent c and the diagonal ele- 
ments of e correspond to the eigenstates around S,. If the 
number of selected eigenstates is independent of the size 
of system, the computational cost required for Eq. (B3) 
is 0(iV), which arises from the matrix multiplications of 
(bi\H\hi) and (b;|S'|b;). Therefore, the computational 
cost of the iterative calculation is governed by the ma- 
trix multiplication of (H — £,S)~^yJ in Eq. (B4), where 

yT = sci. 

Here we show that the matrix multiplication of {H — 
^S)^^yf can be performed by a similar way with the 
same computational complexity as in the calculation for 
the density matrix. As an example of {H — ^S), let us 
consider the matrix X given by Eq. ([2^ . After the re- 
currence calculation of Eqs. ([25])- (32), it turns out that 
the matrix X is factorized as 



X I^lI^Q±Jljr\ Ij-i 



(B4) 



with matrices defined by 

/ ^0,0 



D 



Ao,i 



So,o 



^0,2 



Ao,3 



So, 



Si,o J 



Lofi Lo,i Ico,o 



-''-40,2 

-^-40, 3 
^0,2 Lo,3 lCo,i 



IC,,0 I 



and 



ii = 



/ ^Ao , 



-^Ao,! 



^Co^ 



-^-40. 2 



-^^0,3 



where Iaq o stands for an identity matrix with the same 
size as that of the matrix Ao,o, and the same rule applies 
to other cases. Then, we see that the inverse of X is 
given by 



x-^ = {Lir\Lir^D-\ur\L,r^ 

with matrices defined by 

/ ^-4o,o 



(io)-' 



and 



(ii)- 



(B5) 



-io,0 --f'O,! -fCo.o 



-f-4o,3 
—1/0,2 -io,3 /Co.i 



/ -^-40.0 






-^-40,1 



■''Co.o 



-^-40, 2 



V 



--^1,0 



^Aa- 



-Li 



'Co. 



Ic,.o J 



It should be noted that the inverses of Lq and Li are 
remarkably simple, and that the inverse of D is found 
to be a matrix consisting of diagonal block inverses. In 
general cases, we see that a matrix X and its inverse are 
given by 



X ^ Lp ■ ■ ■ LiLqDLq Li ■ ■ ■ L 



Pj 



(B6) 



X-' = {Ll)-^---{L^)-\Llr^D-^ 

x{Lo)-\L,)-^---{Lp)-\ (B7) 

where the inverse Lp is given in a similar form as well as 
those of Lq and Li. 

By considering Eq. (B7) and the simple forms of 
{Lp)~^, the matrix multiplication of X^^y'^ can be 
performed by the following three steps: 



(i) The first step, (y'f = {Lo)-\Li)-^ 
calculated by 



-1,.T 



iLp)-'y 



IS 



(y'i^Cp.J) = -Lp^2n{y[Lp,27i]) - Lp^27i+l{y[Lp,2n+l) 

+iy[ic,Jf, (B8) 

(.y'liAoJf = (yiiAoJf, (B9) 

where p = 0, • • • , P and n = 0, • • • , 2^"^ - 1 in Eq. (B8), 
and n = 0, • • • , 2^ - 1 in Eq. (B9). 

(ii) The second step, (y")"^ = L>^^{y')'^, is calculated by 

iy"[ic,Jf = {Sp,nrHy'[Sp,n]f, (bio) 
iy"[iAoJf = {Ao,rHy'[iAoJf^ (Bii) 
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TABLE III: Computational order of Eqs. 
dHUJ, and (fBTsI - 



B8|l . (IBToIi . (|BTT1i . 



ID 



2D 



Eq. (Ull) 

Eq. (|BT0)| 

Eq. (|BTTt 

Eqs. (|BT2ll +(|BT3ll 



iVlogjiV 

AT 



Ar3/2 



3D 



TV 



5/3 



7V4/3 



A^logaA- 

TV TV 

7^3/2 ^5/3 



where Xg = y", p + 1 = 1, • • • , P + 1, and m = p, • • • , P. 
At the end of the recurrence calculation, we obtain the 
result of the multiplication as 



X-iy^ = xp+i = (H-e5)-V. 



(B15) 



where p — 0, • • • , P and n = 0, • • • , 2^ ^ — 1 
Eq. (BIO), and n == 0, • • • , 2^ - 1 in Eq. (Bll). 

(iii) The third step, {L^)-^ ■ ■ ■ {Lj)-^L^)-Hy")^ , 
performed by the following recurrence formulas: 



(Xp+l[£p,2n]) — (Xp[Pp,2ri] 



(Xp 



-l[Lp^2n+l]) 



(xp+i[/c„,J)' 



- (Lp,2„)^(xp[/c„J)^, 
(B12) 

{Xp[Lp^2n+l]) 

-(ip,2„+i)^(xp[/c„„])^, (B13) 

(xp[/c„,J)^, (B14) 



The computational effort of the three steps can be eas- 
ily estimated by the same way as for the calculation of the 
inverse matrix, and summarized in Table III. It is found 
that the the computational complexity of the three steps 
is lower than that of the calculation of the inverse ma- 
trix. Thus, if the number of selected eigenstates and the 
number of iterations for convergence are independent of 
the size of system, the computational effort of calculation 
of the selected eigenstates is governed by the recurrence 
calculation of Eqs. (l^51) - ([5^ even for the calculation of 
selected eigenstates. The scheme may be useful for cal- 
culation of eigenstates near the Fermi level. 
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